Accelerator, method of operating the accelerator, and device including the accelerator

ABSTRACT

A method of operating an accelerator includes receiving, from a central processing unit (CPU), commands for the accelerator and a peripheral device of the accelerator, processing the received commands according to a subject of performance of each of the commands, and transmitting a completion message indicating that performance of the commands is completed to the CPU after the performance of the commands is completed.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 USC § 119(a) of KoreanPatent Application No. 10-2019-0172348 filed on Dec. 20, 2019, in theKorean Intellectual Property Office, the entire disclosure of which areincorporated herein by reference for all purposes.

BACKGROUND 1. Field

The following description relates to an accelerator, a method ofoperating the accelerator, and a device including the accelerator.

2. Description of Related Art

A machine learning application including a deep neural network (DNN) mayinclude numerous operations including a great amount of calculation ormemory requirements. The machine learning application may thus use agreat amount of resources. Thus, there is a desire for an improvedtechnology for neural network calculation.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

In one general aspect, a method of operating an accelerator includesreceiving, from a central processing unit (CPU), commands for theaccelerator and a peripheral device of the accelerator, processing thereceived commands according to a subject of performance of each of thecommands, and transmitting a completion message indicating completion ofperformance of the commands to the CPU after the performance of thecommands is completed.

In a case in which a subject to perform a target command, from among thecommands, is the accelerator, the processing of the commands may includeperforming the target command in the accelerator. In a case in which thesubject to perform the target command is the peripheral device, theprocessing of the commands may include transmitting the target commandto the peripheral device.

In the case in which the subject to perform the target command is theperipheral device, after the target command is transmitted from theaccelerator to the peripheral device, the target command may beperformed in the peripheral device without intervention of the CPU.

In the case in which the subject to perform the target command is theperipheral device, after the target command transmitted from theaccelerator is performed in the peripheral device, a completion messageindicating completion of the performance of the target command may betransmitted from the peripheral device to the accelerator.

After performance of a first command, from among the commands, iscompleted, the processing of the commands may include processing asecond command which is a subsequent command of the first commandaccording to a corresponding subject of performance of the secondcommand.

The accelerator may be included in a device together with the CPU, andconfigured to perform a neural network-based inference among processesthat are capable of being processed in the device.

At least one command for the accelerator, from among the commands, maybe a command for performing at least one neural network-based operationin the accelerator.

The commands received from the CPU may be stored in a command queuecomprised in the accelerator based on one or both of a dependency and aperformance order.

In a case in which a subject to perform a target command to beprocessed, from among the commands, is a peripheral device configured tocommunicate with another device, the processing of the commands mayinclude transmitting, to the peripheral device, a processing result ofthe accelerator together with the target command. The target command mayinclude connection information associated with a connection with theother device.

In a case in which a subject to perform a target command to beprocessed, from among the commands, is a storage device, the processingof the commands may include transmitting, to the storage device, thetarget command including information indicating whether to read datafrom the storage device or write data in the storage device.

The accelerator, the CPU, and the peripheral device may transmit orreceive a data signal and/or a control signal through peripheralcomponent interconnect express (PCIe) communication in the same device.

In another general aspect, an accelerator includes a core controllerconfigured to transmit, to an accelerator controller or a peripheraldevice controller, a target command to be performed, from among commandsreceived from a CPU and stored in a command queue according to a subjectof performance of each command, the accelerator controller configured toreceive, from the core controller, the target command in a case in whichthe accelerator is a subject to perform the target command and toperform the received target command, and the peripheral devicecontroller configured to receive, from the core controller, the targetcommand in a case in which a peripheral device is the subject to performthe target command and to transmit the received target command to theperipheral device. After performance of the commands stored in thecommand queue is completed, the core controller may transmit, to theCPU, a completion message indicating the completion of the performanceof the commands.

In another general aspect, a device includes a CPU configured totransmit, to an accelerator, commands for the accelerator and aperipheral device, the accelerator configured to perform a targetcommand, from among the commands, in a case in which the target commandis to be performed by the accelerator, and transmit, to the peripheraldevice, the target command in a case in which the target command is tobe performed by the peripheral device, and the peripheral deviceconfigured to perform the target command received from the acceleratorin the case in which the target command is to be performed by theperipheral device. After performance of the commands is completed, theaccelerator may transmit, to the CPU, a completion message indicatingthe completion of the performance of the commands.

In another general aspect, an accelerator includes one or moreprocessors to: receive, from a central processing unit (CPU), one ormore commands to be processed; determine, for each of the commands,whether the respective command is to be processed by the one or moreprocessors or to be processed by a peripheral device; in a case in whichit is determined that the respective command is to be processed by theone or more processors, process the respective command; in a case inwhich it is determined that the respective command is to be processed bythe peripheral device, transmit the command to the peripheral device;and transmit a completion message to the CPU only after it is confirmedthat all of the commands have been processed.

The one or more processors of the accelerator may determine, for each ofthe commands, whether the respective command is to be processed by theone or more processors or to be processed by the peripheral device basedon whether the respective command requires performance of a neuralnetwork-based inference or calculation of a gradient for neural networklearning.

The one or more processors of the accelerator may transmit, in the casein which it is determined that the respective command is to be processedby the peripheral device, a processing result of the one or moreprocessors to the peripheral device, in a case in which it is determinedthat the processing result is needed for the peripheral device toperform the respective command.

The one or more processors of the accelerator may transmit thecompletion message to the CPU after receiving a command completionmessage from the peripheral device indicating that all of the commandsto be processed by the peripheral device have been processed by theperipheral device.

Other features and aspects will be apparent from the following detaileddescription, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1a and 1b are diagrams illustrating an example of a device.

FIG. 2 is a diagram illustrating an example of an operation of each of acentral processing unit (CPU), an accelerator, and a peripheral device.

FIGS. 3, 4, 5, and 6 are diagrams illustrating examples of how anaccelerator transmits a command directly to a peripheral device.

FIGS. 7 and 8 are diagrams illustrating examples of a device.

FIG. 9 is a flowchart illustrating an example of a method of operatingan accelerator.

Throughout the drawings and the detailed description, unless otherwisedescribed or provided, the same drawing reference numerals will beunderstood to refer to the same elements, features, and structures. Thedrawings may not be to scale, and the relative size, proportions, anddepiction of elements in the drawings may be exaggerated for clarity,illustration, and convenience.

DETAILED DESCRIPTION

The following detailed description is provided to assist the reader ingaining a comprehensive understanding of the methods, apparatuses,and/or systems described herein. However, various changes,modifications, and equivalents of the methods, apparatuses, and/orsystems described herein will be apparent after an understanding of thedisclosure of this application. For example, the sequences of operationsdescribed herein are merely examples, and are not limited to those setforth herein, but may be changed as will be apparent after anunderstanding of the disclosure of this application, with the exceptionof operations necessarily occurring in a certain order.

The features described herein may be embodied in different forms and arenot to be construed as being limited to the examples described herein.Rather, the examples described herein have been provided merely toillustrate some of the many possible ways of implementing the methods,apparatuses, and/or systems described herein that will be apparent afteran understanding of the disclosure of this application.

Although terms such as “first,” “second,” and “third” may be used hereinto describe various members, components, regions, layers, or sections,these members, components, regions, layers, or sections are not to belimited by these terms. Rather, these terms are only used to distinguishone member, component, region, layer, or section from another member,component, region, layer, or section. Thus, a first member, component,region, layer, or section referred to in examples described herein mayalso be referred to as a second member, component, region, layer, orsection without departing from the teachings of the examples.

Throughout the specification, when a component is described as being“connected to,” or “coupled to” another component, it may be directly“connected to,” or “coupled to” the other component, or there may be oneor more other components intervening therebetween. In contrast, when anelement is described as being “directly connected to,” or “directlycoupled to” another element, there can be no other elements interveningtherebetween. Likewise, similar expressions, for example, “between” and“immediately between,” and “adjacent to” and “immediately adjacent to,”are also to be construed in the same way. As used herein, the term“and/or” includes any one and any combination of any two or more of theassociated listed items.

The terminology used herein is for describing various examples only andis not to be used to limit the disclosure. The articles “a,” “an,” and“the” are intended to include the plural forms as well, unless thecontext clearly indicates otherwise. The terms “comprises,” “includes,”and “has” specify the presence of stated features, numbers, operations,members, elements, and/or combinations thereof, but do not preclude thepresence or addition of one or more other features, numbers, operations,members, elements, and/or combinations thereof.

Unless otherwise defined, all terms, including technical and scientificterms, used herein have the same meaning as commonly understood by oneof ordinary skill in the art to which this disclosure pertains and basedon an understanding of the disclosure of the present application. Terms,such as those defined in commonly used dictionaries, are to beinterpreted as having a meaning that is consistent with their meaning inthe context of the relevant art and the disclosure of the presentapplication and are not to be interpreted in an idealized or overlyformal sense unless expressly so defined herein.

Also, in the description of example embodiments, detailed description ofstructures or functions that are thereby known after an understanding ofthe disclosure of the present application will be omitted when it isdeemed that such description will cause ambiguous interpretation of theexample embodiments.

Hereinafter, examples will be described in detail with reference to theaccompanying drawings, and like reference numerals in the drawings referto like elements throughout.

FIGS. 1a and 1b are diagrams illustrating an example of a device.

Referring to FIG. 1a , a device 100 includes a central processing unit(CPU) 110, an accelerator 120, a peripheral device 130, and a dynamicrandom-access memory (DRAM) 140.

In the example of FIG. 1a , when a destination of a result of a neuralprocessing unit (NPU) operation to be performed in the accelerator 120is the peripheral device 130, the CPU 110 may transmit a command forperforming the NPU operation to the accelerator 120 through a controlsignal path. The accelerator 120 may perform the NPU operation upon thecommand, and store a result of performing the NPU operation in the DRAM140 through a data path. The accelerator 120 may then transmit acompletion message indicating completion of the performance of thecommand to the CPU 110 through a completion signal path. When the CPU110 receives the completion message from the accelerator 120, the CPU110 may transmit a command for reading the result of the NPU operationto the peripheral device 130 through a control signal path. Theperipheral device 130 may then read the result of the NPU operation fromthe DRAM 140 upon the command of the CPU 110. For example, when theperipheral device 130 is a network interface device configured tocommunicate with another device, the peripheral device 130 may transmitthe result of the NPU operation to the other device. For anotherexample, when the peripheral device 130 is a storage, the peripheraldevice 130 may store the result of the NPU operation. The peripheraldevice 130 may then transmit a completion message to the CPU 110 througha completion signal path.

As described above with reference to FIG. 1a , the CPU 110 may transmita command to a device, and then transmit a next command to anotherdevice in sequential order only after a completion message indicatingcompletion of performance of the command is received. Thus, in such anexample, frequent interventions of the CPU 110 may be required,resulting in an increase in latency. To minimize such unnecessarylatency and improve throughput, there is provided a structureillustrated in FIG. 1 b.

Referring to FIG. 1b , a device 1000 includes a CPU 1110, an accelerator1120, and a peripheral device 1130.

The CPU 110 may be a universal processor configured to process aprocessing operation performed in the device 1000. The CPU 1110 maygenerate a command for another device included in the device 1000, forexample, the accelerator 1120 and the peripheral device 1130, therebycontrolling an operation of the other device. In an example, the CPU1110 may transmit, to the accelerator 1120, commands for the accelerator1120 and the peripheral device 1130. That is, the CPU 1110 may transmit,to the accelerator 1120, a command for the peripheral device 1130 inaddition to a command for the accelerator 1120, rather than transmittingthe command for the peripheral device 1130 directly to the peripheraldevice 1130. These commands may be transmitted from the CPU 1110 to theaccelerator 1120 through a control signal path.

The accelerator 1120 may be a dedicated processor configured to processa processing operation that is more effectively performed in thededicated processor rather than the universal processor, for example,the CPU 1110. For example, the accelerator 1120 may perform inference ofinput data by performing at least one operation based on a deep neuralnetwork (DNN). The accelerator 1120 may calculate a gradient todetermine parameters of nodes included in a neural network. For example,the accelerator 1120 may include an NPU, a graphics processing unit(GPU), a tensor processing unit (TPU), and the like.

In an example, the accelerator 1120 may receive the commands for theaccelerator 1120 and the peripheral device 1130 from the CPU 1110through the control signal path. The accelerator 1120 may process thereceived commands according to a subject of performance of each of thecommands. For example, when a subject to perform a target command amongthe received commands is the accelerator 1120, the accelerator 1120 maydirectly perform the target command. However, when a subject to performa target command among the received commands is the peripheral device1130, the accelerator 1120 may transmit the target command to theperipheral device 1130. The target command may be transmitted from theaccelerator 1120 to the peripheral device 1130 through a control signalpath between the accelerator 1120 and the peripheral device 1130. Whenan operation result processed and obtained from the accelerator 1120 isneeded for the peripheral device 1130 to perform an operation involvedwith the target command, the operation result of the accelerator 1120may be transmitted from the accelerator 1120 to the peripheral device1130 through a data path.

The peripheral device 1130 may be a component included in the device1000 along with the CPU 1110 and the accelerator 1120, and be embodiedby various devices except the CPU 1110 and the accelerator 1120. Forexample, the peripheral device 1130 may include an interface device suchas a network interface card (NIC), a storage device such as asolid-state drive (SSD), and the like. The peripheral device 1130 mayperform an operation involved with the command received from theaccelerator 1120. When the performance of the command is completed, acompletion message indicating the completion of the performance of thecommand may be transmitted from the peripheral device 1130 to theaccelerator 1120 through a completion signal path. When a processingresult needs to be transmitted from the peripheral device 1130 to theaccelerator 1120, the processing result may be transmitted from theperipheral device 1130 to the accelerator 1120 through the data path.

Although one peripheral device, for example, the peripheral device 1130,is illustrated in FIG. 1b for the convenience of description, aplurality of peripheral devices may be included in the device 1000 andthe following description may be applied to the accelerator 1120 andeach of the peripheral devices included in the device 1000.

When the commands transmitted from the CPU 1110 are all performed by theaccelerator 1120 and/or the peripheral device 1130, the accelerator 1120may transmit a completion message indicating completion of theperformance of the commands to the CPU 1110. For the transmission of thecompletion message, a completion signal path may be used.

For example, when a destination of a result of an NPU operation is theperipheral device 1130, the following operations may be performed in thestructure illustrated in FIG. 1b . The CPU 1110 may transmit allcommands for the accelerator 1120 and the peripheral device 1130 to theaccelerator 1120 through the control signal path. The NPU operation maybe performed in the accelerator 1120 upon a first command among thecommands. When a second command among the commands is for the peripheraldevice 1130, the accelerator 1120 may transmit the second command to theperipheral device 1130 through the control signal path. The peripheraldevice 1130 may then perform an operation involved with the secondcommand. For example, the peripheral device 1130 may read a result ofthe NPU operation from the accelerator 1120 and transmit the result toanother device. In this example, the result of the NPU operation may betransmitted from the accelerator 1120 to the peripheral device 1130through the data path. The peripheral device 1130 may then transmit acompletion message to the accelerator 1120 through the completion signalpath. When the accelerator 1120 verifies that there are no more commandsto be performed, the accelerator 1120 may transmit a completion messageto the CPU 1110 through the completion signal path.

The control/completion signal path illustrated in FIG. 1b may includethe control signal path or the completion signal path. Through thecontrol signal path, a command may be transmitted. Through thecompletion signal path, a completion message may be transmitted. Thedata path may be a path through which data needed to perform a commandor result data obtained by performing the command, for example, a resultof an NPU operation, other than a command and a completion message, istransmitted.

Dissimilar to the structure described above with reference to FIG. 1a ,the CPU 1110 may transmit all commands for both the accelerator 1120 andthe peripheral device 1130 to the accelerator 1120, and the accelerator1120 may then transmit a completion message to the CPU 1110 when all thecommands are performed in the accelerator 1120 and/or the peripheraldevice 1130. Thus, intervention of the CPU 1110 may be minimized, andlatency may thus be reduced effectively.

FIG. 2 is a diagram illustrating an example of an operation of each of aCPU, an accelerator, and a peripheral device.

In the example of FIG. 2, how a CPU, an accelerator, and a peripheraldevice operate is illustrated in the form of a flowchart.

Referring to FIG. 2, in operation 210, the CPU transmits, to theaccelerator, commands for the accelerator and the peripheral device.That is, in addition to at least one command for the accelerator, atleast one command for the peripheral device may also be transmitted tothe accelerator.

In operation 220, the accelerator stores the received commands in acommand queue. The commands may be stored in the command queue based ondependency and/or a performance order. Thus, it is possible to prevent anext command from being performed before a current command is completelyperformed, and thus the commands may be performed based on the presetdependency and/or the performance order.

In operation 230, the accelerator identifies a target command to beperformed from the command queue. To identify the target command to beperformed, the dependency and/or the performance order of the commandsstored in the command queue may be considered.

In operation 240, the accelerator verifies a subject of performance ofthe target command. For example, the accelerator may verify whether thetarget command is associated with the accelerator or the peripheraldevice. When the subject of the performance of the target command is theaccelerator, operation 250 may be performed. When the subject of theperformance of the target command is the peripheral device, operation260 may be performed.

In operation 250, the accelerator performs the target command associatedwith the accelerator. For example, upon the target command, theaccelerator may perform at least one operation, for example, aconvolution operation, a rectified linear unit (ReLU) operation, and asigmoid operation, that is defined in a neural network. That is, theaccelerator may perform neural network-based inference or calculate agradient for neural network learning, according to the target command.

In operation 260, the accelerator transmits the target command to theperipheral device. The target command may be associated with theperipheral device. According to an example, a processing result of theaccelerator that is needed for the peripheral device to perform thetarget command may be additionally transmitted to the peripheral device.

In operation 261, the peripheral device performs the target commandassociated with the peripheral device. For example, when the peripheraldevice is an interface device configured to communicate with anotherdevice, the peripheral device may transmit the processing result of theaccelerator to the other device. For another example, when theperipheral device is a storage device, the peripheral device may readdata from the storage device or write the processing result of theaccelerator in the storage device.

In operation 262, when the performance of the target command iscompleted in the peripheral device, the peripheral device transmits acommand completion message to the accelerator, not to the CPU.

In operation 270, when the performance of the target command iscompleted in the accelerator or the peripheral device, the acceleratorverifies whether there is a next command. For example, the acceleratormay verify whether there is a command in the command queue that is yetto be performed. When there is the next command or the command yet to beexecuted, operation 230 may be performed and the next command may beperformed according to a corresponding subject of performance of thenext command. When there is not the next command or the command yet tobe executed, operation 280 may be performed.

In operation 280, when performance of all the commands received from theCPU and stored in the command queue is completed, the acceleratortransmits a command completion message to the CPU.

As described above, all commands associated with both the acceleratorand the peripheral device may be transmitted from the CPU to theaccelerator, and a command completion message may be transmitted fromthe accelerator to the CPU when performance of all the commands iscompleted, rather than the CPU intervening and giving a next commandeach time performance of each command is completed by the accelerator orthe peripheral device. Thus, it is possible to minimize latency that mayoccur due to the intervention of the CPU and reduce a load on the CPU,thereby enabling faster processing. Through the structure that mayminimize the intervention of the CPU, it is possible to remove theparticipation of hardware except for the accelerator and the peripheraldevice, and effectively reduce the latency. Thus, it is possible toimprove throughput, and effectively reduce the size of a DRAM needed tostore an operation result of the accelerator in an internal memory ofthe accelerator.

FIGS. 3 through 6 are diagrams illustrating examples of how anaccelerator transmits a command directly to a peripheral device.

Referring to FIG. 3, a peripheral device 330 may receive a command froman accelerator 320 and perform the received command.

A CPU 310 may transmit, to the accelerator 320, commands associated withthe accelerator 320 and the peripheral device 330. The commands receivedby the accelerator 320 may be stored in a command queue 321 based ondependency and/or a performance order. In the command queue 321, Cmnd IDindicates an identifier of a command, and Cmnd indicates a content of acommand. Next Cmnd indicates whether a next command is present or not,or an identifier of the next command. In the example of FIG. 3, Cmndindicates a command.

A core controller 323 may identify a target command that needs to beperformed at a current time from the command queue 321, verify a subjectof performance of the identified target command, and transmit the targetcommand to an accelerator core 325 or a peripheral device controller327. When the subject to perform the target command is the accelerator320, the core controller 323 may transmit the target command to theaccelerator core 325. When the subject to perform the target command isthe peripheral device 330, the core controller 323 may transmit thetarget command to the peripheral device controller 327.

The accelerator core 325 may perform the target command received fromthe core controller 323. For example, the accelerator core 325 mayperform an inference operation in response to input data by performingan operation based on a neural network. A result of the operationperformed by the accelerator core 325 may be stored in an internalmemory 329. In the internal memory 329, a hatched portion may indicate aportion in which data is stored, and an empty portion may indicate aportion in which data is not stored and is thus available for a newoperation result to be stored therein.

The peripheral device controller 327 may transmit the target command tothe peripheral device 330 such that the target command received from thecore controller 323 is performed by the peripheral device 330. Accordingto an example, when the operation result of the accelerator core 325 isneeded for the peripheral device 330 to perform the target command, theoperation result stored in the internal memory 329 may be transmitted tothe peripheral device 330.

The peripheral device 330 may perform the target command received fromthe peripheral device controller 327. When the performance of the targetcommand is completed, the peripheral device 330 may transmit a commandcompletion message to the peripheral device controller 327 to notifythat the performance of the target command is completed.

Hereinafter, examples will be described in greater detail.

Referring to FIG. 4, an NIC 430 may receive a command from anaccelerator 420 and perform the received command. The NIC 430 may be anexample of an interface device configured to communicate with anotherdevice. However, examples are not limited to the NIC 430, and othervarious interface and/or communication devices may be applied.

A result of an operation performed by the accelerator 420, for example,a gradient, may be transmitted immediately to the NIC 430 withoutintervention or help of a CPU 410 and then transmitted to anotherdevice, for example, another server, through a network. Such a methodmay enable rapid data sharing, which will be described in detailhereinafter.

When there is a need to transmit such an operation result of theaccelerator 420 to an outside of a device, the CPU 410 may transmit, tothe accelerator 420, a command for the NIC 430 in addition to a commandfor the accelerator 420. For this, a bit indicating whether there is anassociated NIC command may be separately present.

The commands received from the CPU 410 may be stored in a command queue421. In the example of FIG. 4, in the command queue 421, an NIC commandthat needs to be performed after an accelerator command may be stored asa pair in a same row as where the accelerator command is. For example,for a first command, an NIC command may not need to be performed afteran accelerator command, except that a second command may be performed.However, the second command may indicate that an NIC command needs to beperformed after an accelerator command because the NIC command is afterthe accelerator command.

When a core controller 423 verifies that a command to be performed isassociated with the accelerator 420 by referring to the command queue421, the core controller 423 may transmit such an accelerator command toan accelerator core 425 based on availability of the accelerator core425. For example, when the availability of the accelerator core 425 islow, the core controller 423 may transmit the accelerator command to theaccelerator core 425 after a preset amount of time elapses or when theavailability becomes greater than or equal to a threshold rate.

The accelerator core 425 may perform an operation according to thereceived accelerator command. For example, when Conv(z, w) is receivedas the accelerator command, the accelerator core 425 may perform aconvolution operation on (z, w). When performance of the operation iscompleted, a result of performing the operation may be stored in aninternal memory, and the accelerator core 425 may notify the corecontroller 423 of the completion of the performance of the command.

When the core controller 423 verifies that there is an associated NICcommand by referring to the command queue 421, the core controller 423may transmit the NIC command to an NIC controller 427. The NIC commandmay include connection information associated with a network connectionbetween the device and another device, for example, a counterpartserver, to which the operation result needs to be transmitted,information associated with data of the operation result to betransmitted, for example, a physical address in which the operationresult is stored in the internal memory, a length of the data of theoperation result, and the like. For example, when the device isconnected to the other device according to an Internet Protocol version4 (IPv4) transmission control protocol (TCP), the NIC command mayinclude an IP address of each of the device and the other device, portinformation, and TCP ack/seq. number, and the like.

That is, by referring to the command queue 421, it is possible to verifywhether the operation result needs to be transmitted to an outside ofthe device after the accelerator core 425 performs the correspondingcommand, and verify where to transmit the operation result.

As described above, the core controller 423 may read a command from thecommand queue 421, and distribute a corresponding command to theaccelerator core 425 and the NIC controller 427.

The NIC controller 427 may be a logic installed in the accelerator 420to control the NIC 430. In an example, the NIC controller 427 mayinclude a transmission queue and a completion queue to control the NIC430. To transmit a command to the NIC 430, the NIC controller 427 mayinput the command to the transmission queue and then write a value in adoorbell, for example, an internal register, of the NIC 430. The NIC 430may then read the command from the transmission queue to perform anoperation, for example, transmit an operation result to another device,and then write a corresponding result in the completion queue. The corecontroller 423 may then determine whether performance of the command iscompleted or not based on the completion queue. In the example of FIG.4, an NIC command 1.2.3.4/12 stored in the command queue 421 may beinformation needed to generate a TCP header for TCP communication. Apair of the transmission queue and the completion queue may be presentas a plurality of pairs for a single NIC 430. In such a case, variousaccelerators may use the NIC 430 simultaneously.

When all the commands stored in the command queue 421 are performed, thecore controller 423 may transmit, to the CPU 410, a command completionmessage along with an interruption. When the performance of all thecommands including the command for the NIC 430 is completed, theaccelerator 420 may transmit the command completion message to the CPU410. Thus, it is possible to prevent an increase in unnecessary latencyby restricting participation of the CPU 410 until an operation result isobtained from the accelerator 420 and transmitted to another device.

In an example, the accelerator 420 and the NIC 430 may be present indifferent boards, and the CPU 410, the accelerator 420, and the NIC 430may be connected through peripheral component interconnect express(PCIe) interconnection. Through communication between the CPU 410 andthe accelerator 420, commands may be mainly exchanged. Throughcommunication between the accelerator 420 and the NIC 430, commands anddata, for example, an operation result, may be exchanged. In addition,components in the accelerator 420 may communicate with one another on acircuit.

Through the method described above, it is possible to more rapidlytransmit data to an outside of the device, and enable rapid data sharingin a distributed processing environment. For example, in a neuralnetwork learning process where updating needs to be performed throughsharing gradients between servers, a learning time may be reducedthrough the method. By allowing the NIC 430 to externally transmit anoperation result immediately when the operation result is obtained, itis possible to reduce a size of a DRAM that stores the operation result.

Referring to FIG. 5, an SSD 530 may receive a command from anaccelerator 520 and perform the received command. The SSD 530 may be anexample of a storage device included in a device. However, examples arenot limited to the SSD 530, and other various storage devices may beapplied.

The accelerator 520 may receive commands for the accelerator 520 and theSSD 530 from a CPU 510. A core controller 523 may transmit a command forthe SSD 530 to the SSD 530 through an SSD controller 527 such that thecommand is performed. For this, the accelerator 520 may include the SSDcontroller 527 to control the SSD 530, and a corresponding form ofcommand may be stored in a command queue 521. For example, an SSDcommand to be stored in the command queue 521 may include informationassociated with a position and a length of internal SSD data, andinformation indicating whether to read or write data.

For example, when an operation needs to be performed by reading datafrom the SSD 530, an SSD read command may be first performed. When theperformance of the SSD read command is completed, an accelerator core525 may immediately start the operation. Thus, it is possible to enablea rapid operation without latency due to the CPU 510.

In the command queue 521, SSD read(x, y) indicates an SSD command forreading data based on (x, y), and Conv(z, w) indicates an acceleratorcommand for performing a convolution operation on (z, w). Next Cmndindicates whether there is a next command, or an identifier of the nextcommand. The commands may be represented in a form of chain.

Through the method described above, even when massive data is stored ina storage device, for example, the SSD 530, a rapid operation may beenabled by allowing the accelerator 520 to start an operationimmediately after the data is read.

Referring to FIG. 6, a memory 630 may receive a command from anaccelerator 620 and perform the received command. The memory 630 may bea memory that is accessible through PCIe communication and be, forexample, an internal GPU memory and an internal CPU memory. In thisexample, the memory 630 may be accessible through a direct memory access(DMA) 627 included in the accelerator 620. Even when data is included inthe internal GPU memory or the internal CPU memory due to a disk cacheof an operating system (OS), for example, it is possible to minimizelatency by applying the method described above.

FIGS. 7 and 8 are diagrams illustrating examples of a device.

Referring to FIG. 7, a device may be a server 700. The server 700includes a CPU 710, an accelerator 720, and a peripheral device 730. TheCPU 710 may transmit, to the accelerator 720, commands for theaccelerator 720 and the peripheral device 730. The accelerator 720 mayprocess the received commands according to a subject of performance ofeach of the commands. The peripheral device 730 may perform a commandtransmitted from the accelerator 720 and transmit a completion messageindicating completion of the performance of the command to theaccelerator 720. When all the commands are performed, the accelerator720 may transmit a completion message indicating completion of theperformance of the commands to the CPU 710.

Referring to FIG. 8, a device may be a user terminal 800. The userterminal 800 includes a CPU 810, an accelerator 820, and a peripheraldevice 830. Each of such components may perform corresponding operationsdescribed above with reference to FIG. 7. Although a smartphone isillustrated as an example of the user terminal 800 in FIG. 8 for theconvenience of description, various computing devices such as a personalcomputer (PC), a tablet PC, and a laptop, various wearable devices suchas a smartwatch and smart eyeglasses, various home appliances such as asmart speaker, a smart television (TV), and a smart refrigerator, and asmart vehicle, a smart kiosk, an internet of things (IoT) device, andthe like may be applied without restriction.

FIG. 9 is a flowchart illustrating an example of a method of operatingan accelerator.

In the example of FIG. 9, how an accelerator operates is illustrated.

Referring to FIG. 9, in operation 910, the accelerator receives, from aCPU, commands for the accelerator and a peripheral device of theaccelerator. In operation 920, the accelerator processes the receivedcommands according to a subject of performance of each of the commands.In operation 930, when performance of the commands is completed, theaccelerator transmits, to the CPU, a completion message indicating thecompletion of the performance of the commands.

For a more detailed description of the operations described above withreference to FIG. 9, reference may be made to what is described abovewith reference to FIGS. 1a through 8.

The accelerator, the device including the accelerator, and otherapparatuses, units, modules, devices, and other components describedherein with respect to FIGS. 1b , and 3-8 are implemented by hardwarecomponents. Examples of hardware components that may be used to performthe operations described in this application where appropriate includecontrollers, sensors, generators, drivers, memories, comparators,arithmetic logic units, adders, subtractors, multipliers, dividers,integrators, and any other electronic components configured to performthe operations described in this application. In other examples, one ormore of the hardware components that perform the operations described inthis application are implemented by computing hardware, for example, byone or more processors or computers. A processor or computer may beimplemented by one or more processing elements, such as an array oflogic gates, a controller and an arithmetic logic unit, a digital signalprocessor, a microcomputer, a programmable logic controller, afield-programmable gate array, a programmable logic array, amicroprocessor, or any other device or combination of devices that isconfigured to respond to and execute instructions in a defined manner toachieve a desired result. In one example, a processor or computerincludes, or is connected to, one or more memories storing instructionsor software that are executed by the processor or computer. Hardwarecomponents implemented by a processor or computer may executeinstructions or software, such as an operating system (OS) and one ormore software applications that run on the OS, to perform the operationsdescribed in this application. The hardware components may also access,manipulate, process, create, and store data in response to execution ofthe instructions or software. For simplicity, the singular term“processor” or “computer” may be used in the description of the examplesdescribed in this application, but in other examples multiple processorsor computers may be used, or a processor or computer may includemultiple processing elements, or multiple types of processing elements,or both. For example, a single hardware component or two or morehardware components may be implemented by a single processor, or two ormore processors, or a processor and a controller. One or more hardwarecomponents may be implemented by one or more processors, or a processorand a controller, and one or more other hardware components may beimplemented by one or more other processors, or another processor andanother controller. One or more processors, or a processor and acontroller, may implement a single hardware component, or two or morehardware components. A hardware component may have any one or more ofdifferent processing configurations, examples of which include a singleprocessor, independent processors, parallel processors,single-instruction single-data (SISD) multiprocessing,single-instruction multiple-data (SIMD) multiprocessing,multiple-instruction single-data (MISD) multiprocessing, andmultiple-instruction multiple-data (MIMD) multiprocessing.

The methods illustrated in FIGS. 1b -9 that perform the operationsdescribed in this application are performed by computing hardware, forexample, by one or more processors or computers, implemented asdescribed above executing instructions or software to perform theoperations described in this application that are performed by themethods. For example, a single operation or two or more operations maybe performed by a single processor, or two or more processors, or aprocessor and a controller. One or more operations may be performed byone or more processors, or a processor and a controller, and one or moreother operations may be performed by one or more other processors, oranother processor and another controller. One or more processors, or aprocessor and a controller, may perform a single operation, or two ormore operations.

Instructions or software to control a processor or computer to implementthe hardware components and perform the methods as described above arewritten as computer programs, code segments, instructions or anycombination thereof, for individually or collectively instructing orconfiguring the processor or computer to operate as a machine orspecial-purpose computer to perform the operations performed by thehardware components and the methods as described above. In one example,the instructions or software include machine code that is directlyexecuted by the processor or computer, such as machine code produced bya compiler. In another example, the instructions or software includehigher-level code that is executed by the processor or computer using aninterpreter. Programmers of ordinary skill in the art can readily writethe instructions or software based on the block diagrams and the flowcharts illustrated in the drawings and the corresponding descriptions inthe specification, which disclose algorithms for performing theoperations performed by the hardware components and the methods asdescribed above.

The instructions or software to control a processor or computer toimplement the hardware components and perform the methods as describedabove, and any associated data, data files, and data structures, arerecorded, stored, or fixed in or on one or more non-transitorycomputer-readable storage media. Examples of a non-transitorycomputer-readable storage medium include read-only memory (ROM),random-access programmable read only memory (PROM), electricallyerasable programmable read-only memory (EEPROM), random-access memory(RAM), dynamic random access memory (DRAM), static random access memory(SRAM), flash memory, non-volatile memory, CD-ROMs, CD-Rs, CD+Rs,CD-RWs, CD+RWs, DVD-ROMs, DVD-Rs, DVD+Rs, DVD-RWs, DVD+RWs, DVD-RAMs,BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, blue-ray or optical disk storage,hard disk drive (HDD), solid state drive (SSD), flash memory, a cardtype memory such as multimedia card micro or a card (for example, securedigital (SD) or extreme digital (XD)), magnetic tapes, floppy disks,magneto-optical data storage devices, optical data storage devices, harddisks, solid-state disks, and any other device that is configured tostore the instructions or software and any associated data, data files,and data structures in a non-transitory manner and providing theinstructions or software and any associated data, data files, and datastructures to a processor or computer so that the processor or computercan execute the instructions.

While this disclosure includes specific examples, it will be apparent toone of ordinary skill in the art that various changes in form anddetails may be made in these examples without departing from the spiritand scope of the claims and their equivalents. The examples describedherein are to be considered in a descriptive sense only, and not forpurposes of limitation. Descriptions of features or aspects in eachexample are to be considered as being applicable to similar features oraspects in other examples. Suitable results may be achieved if thedescribed techniques are performed in a different order, and/or ifcomponents in a described system, architecture, device, or circuit arecombined in a different manner, and/or replaced or supplemented by othercomponents or their equivalents. Therefore, the scope of the disclosureis defined not by the detailed description, but by the claims and theirequivalents, and all variations within the scope of the claims and theirequivalents are to be construed as being included in the disclosure.

What is claimed is:
 1. A method of operating an accelerator, comprising:receiving, from a central processing unit (CPU), commands for theaccelerator and a peripheral device of the accelerator; processing thereceived commands according to a subject of performance of each of thecommands; and after performance of the commands is completed,transmitting a completion message indicating the completion of theperformance of the commands to the CPU.
 2. The method of claim 1,wherein processing the commands comprises: in a case in which a subjectto perform a target command, from among the commands, is theaccelerator, performing the target command in the accelerator; and in acase in which the subject to perform the target command is theperipheral device, transmitting the target command to the peripheraldevice.
 3. The method of claim 2, wherein, in the case in which thesubject to perform the target command is the peripheral device, afterthe target command is transmitted from the accelerator to the peripheraldevice, the target command is performed in the peripheral device withoutintervention of the CPU.
 4. The method of claim 2, wherein, in the casein which the subject to perform the target command is the peripheraldevice, after the target command is performed in the peripheral device,a completion message indicating completion of the performance of thetarget command is transmitted from the peripheral device to theaccelerator.
 5. The method of claim 1, wherein processing the commandscomprises: after performance of a first command, from among thecommands, is completed, processing a second command, which is asubsequent command of the first command, according to a correspondingsubject of performance of the second command.
 6. The method of claim 1,wherein the accelerator is comprised in a device together with the CPU,and configured to perform a neural network-based inference amongprocesses that are capable of being processed in the device.
 7. Themethod of claim 1, wherein at least one command for the accelerator,from among the commands, is a command for performing at least one neuralnetwork-based operation in the accelerator.
 8. The method of claim 1,wherein the commands received from the CPU are stored in a command queuecomprised in the accelerator based on one or both of a dependency and aperformance order.
 9. The method of claim 1, wherein processing thecommands comprises: in a case in which a subject to perform a targetcommand, from among the commands, is a peripheral device configured tocommunicate with another device, transmitting, to the peripheral device,a processing result of the accelerator together with the target command,wherein the target command includes connection information associatedwith a connection with the other device.
 10. The method of claim 1,wherein processing the commands comprises: in a case in which a subjectto perform a target command, from among the commands, is a storagedevice, transmitting, to the storage device, the target commandincluding information indicating whether to read data from the storagedevice or write data in the storage device.
 11. The method of claim 1,wherein the accelerator, the CPU, and the peripheral device areconfigured to transmit or receive a data signal and/or a control signalthrough peripheral component interconnect express (PCIe) communicationin a same device.
 12. A non-transitory computer-readable storage mediumstoring commands that, when executed by a processor, cause the processorto perform the method of claim
 1. 13. An accelerator comprising: a corecontroller configured to transmit, to an accelerator controller or aperipheral device controller, a target command to be performed, fromamong commands received from a central processing unit (CPU) and storedin a command queue according to a subject of performance of eachcommand; the accelerator controller configured to receive, from the corecontroller, the target command in a case in which the accelerator is asubject to perform the target command, and perform the received targetcommand; and the peripheral device controller configured to receive,from the core controller, the target command in a case in which aperipheral device is the subject to perform the target command, andtransmit the received target command to the peripheral device, wherein,after performance of the commands stored in the command queue iscompleted, the core controller is configured to transmit, to the CPU, acompletion message indicating completion of the performance of thecommands.
 14. The accelerator of claim 13, wherein, after performance ofa first command, from among the commands, is completed, the corecontroller is configured to transmit, to the accelerator controller orthe peripheral device controller, a second command, which is asubsequent command of the first command according to a correspondingsubject of performance of the second command.
 15. The accelerator ofclaim 13, wherein the accelerator is comprised in a device together withthe CPU, and is configured to perform a neural network-based inferenceamong processes that are capable of being processed in the device. 16.The accelerator of claim 13, wherein the command queue is comprised inthe accelerator and the commands received from the CPU are stored in thecommand queue based on one or both of a dependency and a performanceorder.
 17. The accelerator of claim 13, wherein the accelerator, theCPU, and the peripheral device are configured to transmit or receive adata signal and/or a control signal through peripheral componentinterconnect express (PCIe) communication in a same device.
 18. A devicecomprising: a central processing unit (CPU) configured to transmit, toan accelerator, commands for the accelerator and a peripheral device;the accelerator configured to perform a target command, from among thecommands, in a case in which the target command is to be performed bythe accelerator, and transmit, to the peripheral device, the targetcommand in a case in which the target command is to be performed by theperipheral device; and the peripheral device configured to perform thetarget command received from the accelerator in the case in which thetarget command is to be performed by the peripheral device, wherein,after performance of the commands is completed, the accelerator isconfigured to transmit, to the CPU, a completion message indicatingcompletion of the performance of the commands.
 19. The device of claim18, wherein, in the case in which the target command is to be performedby the peripheral device, after the target command is transmitted fromthe accelerator to the peripheral device, the target command isperformed in the peripheral device without intervention of the CPU. 20.An accelerator, comprising: one or more processors configured to:receive, from a central processing unit (CPU), one or more commands tobe processed; determine, for each of the commands, whether therespective command is to be processed by the one or more processors orto be processed by a peripheral device; in a case in which it isdetermined that the respective command is to be processed by the one ormore processors, process the respective command; in a case in which itis determined that the respective command is to be processed by theperipheral device, transmit the command to the peripheral device; andtransmit a completion message to the CPU only after it is confirmed thatall of the commands have been processed.
 21. The accelerator accordingto claim 20, wherein the one or more processors are configured todetermine, for each of the commands, whether the respective command isto be processed by the one or more processors or to be processed by theperipheral device based on whether the respective command requiresperformance of a neural network-based inference or calculation of agradient for neural network learning.
 22. The accelerator according toclaim 20, wherein the one or more processors are configured to transmit,in the case in which it is determined that the respective command is tobe processed by the peripheral device, a processing result of the one ormore processors to the peripheral device, in a case in which it isdetermined that the processing result is needed for the peripheraldevice to perform the respective command.
 23. The accelerator accordingto claim 20, wherein the one or more processors are configured totransmit the completion message to the CPU after receiving a commandcompletion message from the peripheral device indicating that all of thecommands to be processed by the peripheral device have been processed bythe peripheral device.